-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable predicate pushdown for categorical dimension filters #1227
Enable predicate pushdown for categorical dimension filters #1227
Conversation
9040f9f
to
ee88ff9
Compare
8b41889
to
fba565c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick note - the intervening commits show where I got a little too aggressive with pushdown, and the snapshots show one of the issues with the "during construction" approach - it's exceedingly difficult to properly push the filter down without duplicating it in the query, because the recursive tracking of which filters got pushed and which ones didn't is super complicated. Some of these pushdown operations are effectively undoing the subquery reducing optimizer, but have no other practical effect.
Future updates will:
- Add some dedicated join + filter test cases to ensure things keep returning the same results in the face of predicate pushdown. We incidentally seem to be covering the currently operating edge cases, but having explicit tests for all of the boundary conditions will be helpful as we restructure.
- Finish consolidating our pushdown logic - including time window adjustments - into the pushdown state object
- Convert our process to an optimizer pass and remove most predicate management from the DataflowPlanBuilder
- Fix up the pushdown rendering so we don't double-render filters without justification
Whether those come before or after basic support for time dimensions remains to be seen, but I think it likely we'll do them first.
) subq_21 | ||
WHERE booking__is_instant | ||
) subq_23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty dumb. If the filter is defined on a measure we typically apply it immediately anyway. The only way to not do this is to be more clever about storing recursive state and ensuring that predicates only get pushed past joins.
For now I think this will do, but it's something we should probably resolve when we add support for time filters, as simple queries with time filters will often do an unnecessary pushdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the snapshots and tests look good.
After reading this, I'm still confused as to how we ensure only categorical dimensions get pushed down 🤔 So trying to dig in to understand that before I approve!
@@ -680,7 +691,9 @@ def _build_plan_for_distinct_values(self, query_spec: MetricFlowQuerySpec) -> Da | |||
required_linkable_specs, _ = self.__get_required_and_extraneous_linkable_specs( | |||
queried_linkable_specs=query_spec.linkable_specs, filter_specs=query_level_filter_specs | |||
) | |||
predicate_pushdown_state = PredicatePushdownState(time_range_constraint=query_spec.time_range_constraint) | |||
predicate_pushdown_state = PredicatePushdownState( | |||
time_range_constraint=query_spec.time_range_constraint, where_filter_specs=query_level_filter_specs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this get narrowed down to only categorical dimensions? 🤔
@@ -22,17 +22,34 @@ FROM ( | |||
-- Join Standard Outputs | |||
-- Pass Only Elements: ['average_booking_value', 'listing__is_lux_latest', 'booking__is_instant'] | |||
SELECT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but I've been wondering if we can optimize away these interim subqueries that do nothing but select columns 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're supposed to get optimized away, and I'm not entirely sure why they don't. There may be some column or subquery alias thing that's getting in the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looooks goood!!!
Left a couple of inline nits, nothing blocking!
invalid_element_types = [ | ||
element for element in spec.linkable_elements if element.element_type not in enabled_element_types | ||
] | ||
if len(semantic_models) == 1 and len(invalid_element_types) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so we only push down filters if ALL the filtered elements are eligible element types. Is that because we don't know if this is an AND
or an OR
filter at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, we cannot push down a filter with any invalid element types. The AND vs OR nature of things isn't relevant, it's because we don't have a way to handle those element types. Right now that's just because we haven't implemented handling, but in future it could be due to a given query being too difficult to manage for a given element type.
For example, agg time dimension filters against a mixture of cumulative and derived offset metric inputs could get very tricky. In those cases we may not be able to push down a where filter with a time dimension.
My expectation is that this will be more refined than clobbering anything that has a time dimension of any kind in it, but for now this definitely works and we can use more finesse later.
eligible_filter_specs_by_model: Dict[SemanticModelReference, Sequence[WhereFilterSpec]] = {} | ||
for spec in where_filter_specs: | ||
semantic_models = set(element.semantic_model_origin for element in spec.linkable_elements) | ||
invalid_element_types = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this is the logic I was looking for 👍
] | ||
if len(matching_filter_specs) == 0: | ||
filtered_nodes.append(source_node) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this continue
actually do anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I think originally I didn't have an else
here. I've removed the redundant continue statement to keep the structure of this consistent with the time range constraint method.
matching_filter_specs = [ | ||
filter_spec | ||
for filter_spec in eligible_filter_specs | ||
if all([spec in source_node_specs.linkable_specs for spec in filter_spec.linkable_specs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there ever a time where this won't be true, since we get the semantic model from the linkable element above? Or is this just an extra safety check in case something gets misconfigured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this time it's a safeguard against something weird happening where a given source node isn't configured correctly. However, I expect this filter to be relevant for entities, since they may be defined in multiple semantic models and we need to be able to explicitly allow or disallow pushdown in those cases. If we ever add a pre-joined source node, for example, we might encounter a scenario where the entity and dimension come from different semantic models and then we couldn't push down past this point (and maybe shouldn't push down to this point, either).
7672db8
to
506ca50
Compare
2e250d1
to
aea5c0b
Compare
2e4a1e1
to
2ea0703
Compare
27e32dd
to
8d08b8e
Compare
We now have the ability to push down filter predicates within the DataflowPlan. We start with categorical dimension filters, as they are the simplest. This change simply tracks the where filters applied at the measure node and pushes all of them down to the construction of the source node for evaluation. At this time a filter is eligible to be applied to the source node if it only contains references to categorical dimensions that originate from the same, singular semantic model definition that feeds into the source node in question. We do not support time dimensions at this time, as they can cause strange interactions with things like cumulative metrics, which could result in inappropriate input filtering that produces non-obviously censored metric results. We also do not support entities at this time, as entities may be defined in multiple semantic models and as such filters must be applied with more care to ensure we are correctly accounting for the entity link paths to the relevant source node, if any, when we apply the filter. The snapshot test changes for existing snapshots highlight the new behavior, while the added test snapshots demonstrate specific circumstances of interest.
Note for reviewers, as this will be squashed: Pushdown on joined in dimensions was not working as expected. In the original predicate pushdown rendering tests, most joined in dimensions were being skipped for pushdown operations. The root cause of the issue was the semantic model source value we were accessing, which actually included the complete history of all semantic model inputs for the joined in dimension. Fixing that problem uncovered a separate issue, which is we were inappropriately pushing filters down past outer join expressions. This commit fixes both issues at once - we now only push down on the "safe" side of an outer join (the left side for LEFT OUTER and not at all for FULL OUTER joins), and we evaluate pushdown based on the singuolar semantic model source where each element is defined.
8d08b8e
to
ca3043c
Compare
Enable predicate pushdown for categorical dimension filters
We now have the ability to push down filter predicates within
the DataflowPlan. We start with categorical dimension filters,
as they are the simplest.
This change simply tracks the where filters applied at the measure
node and pushes all of them down to the construction of the source
node for evaluation. At this time a filter is eligible to be applied
to the source node if it only contains references to categorical dimensions
that originate from the same, singular semantic model definition that
feeds into the source node in question.
We do not support time dimensions at this time, as they can cause strange
interactions with things like cumulative metrics, which could result in
inappropriate input filtering that produces non-obviously censored metric
results.
We also do not support entities at this time, as entities may be defined
in multiple semantic models and as such filters must be applied with more
care to ensure we are correctly accounting for the entity link paths to
the relevant source node, if any, when we apply the filter.
Finally, we are not able to safely push predicates down for the "null value"
side of an outer join, which, in practice, restricts us to only doing predicate
pushdown to the measure source nodes.
The snapshot test changes for existing snapshots highlight the new behavior,
while the added test snapshots demonstrate specific circumstances of
interest.